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Abstract — This paper introduces new techniques for using convex 
optimization to fit input-output data to a class of stable nonlinear 
dynamical models. We present an algorithm that guarantees consistent 
estimates of models in this class when a small set of repeated experiments 
with suitably independent measurement noise is available. Stability of the 
estimated models is guaranteed without any assumptions on the input- 
output data. We first present a convex optimization scheme for identifying 
stable state-space models from empirical moments. Next, we provide a 
method for using repeated experiments to remove the effect of noise on 
these moment and model estimates. The technique is demonstrated on a 
simple simulated example. 

Index Terms — System identification, nonlinear systems. 



I. INTRODUCTION 

Building nonlinear dynamical models capable of accurate long 
term prediction is a common goal in system identification. However, 
for most model structures multi-step prediction errors have a com- 
plex nonlinear dependence on the model parameters. Furthermore, 
assuring stability of algorithmically generated nonlinear models is a 
substantial challenge. In many practical situations, where data-sets 
are limited or under-modeling is present, widely used "one-step" 
prediction error minimization techniques can render models that are 
unstable or have poor multi-step predictions. This work presents 
a convex optimization method for approximating the input-output 
response of a nonlinear dynamical system via state-space models with 
stability guarantees. This paper extends recent work in [ 30 1 , |3| and 
1 18 1 by providing a family of consistent estimators for a class of 
stable nonlinear models when a small set of repeated experiments is 
available. We examine the problem of embedding an input-output 
identification task inside a state-space modeling framework. We 
inherit from the methods of 1301 . 151 . 1181 an unqualified guarantee 
of model stability and a cost function that is a convex upper bound 
on the "simulation error" associated with these models. However, 
the estimators from |30|, [3|, |18| are generally not consistent, and 
for systems that are nearly marginally stable the biasing effect of 
measurement noise can be quite severe. Furthermore, the complexity 
of these methods grows undesirably with the number of data points. 

We present a modification of algorithms from 1 30 1 that mitigates 
these two difficulties. In particular, a technique that utilizes the 
problem data through empirical moments only is used. As a result, 
the complexity of the method generally grows linearly with data- 
set size. We also provide a method for asymptotically removing the 
effects of measurement noise on these empirical moments when a 
small set of repeated experiments are available, utilizing an idea 
which is superficially similar to instrumental variable methods |14|. 
We that demonstrate that this technique, a nonlinear extension of 1171 . 
recovers consistency when the data is generated by a system within 
a specific class of models. 
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A. Previous Work 

The use of maximum likelihood and one-step prediction error 
methods is frequently motivated by the consistency and asymptotic 
efficiency of the resulting estimators |T4|. In the face of limited data 
or significant under-modeling, these techniques often render models 
that are unstable or make poor multi-step ahead predictions (§). 
Direct minimization of longer term prediction errors have appeared 
in several forms, including the output-error method for input-output 
system identification, 1271 . notions of "best" approximation, |20|, 
and simulation error minimization, [4],|8|. These methods require 
optimization of a non-convex functional for all but the simplest model 
structures (e.g. finite impulse response and Volterra type models) and 
can suffer from local minima |27|. Appealing theoretical properties of 
these methods (e.g. efficiency and unbiasedness) are often predicated 
on finding global minima of generically hard nonlinear programming 
problems. 

Several results are available for linear time invariant (LTI) system 
identification using least squares that provide stability guarantees 
even in the face of under-modeling (e.g. |25|,|26|, [31 1). It is worth 
noting that these stability guarantees apply only as the number of 
available data points tends to infinity and requires an assumption 
that the data is generated by a (potentially under-modeled) stationary 
LTI process. Several modified subspace techniques have also been 
presented to address the issue of model stability. In 1321 regularization 
is used to ensure model stability. In 1131 and 1121 a joint search over 
Lyapunov function and system dynamics using convex optimization 
was used to ensure model stability. The LTI-specific method em- 
ployed by 1 13 1 and 1 12 ] is closely related to the technique by which 
this paper addresses stability. 

Several convex relaxation techniques have recently been employed 
by the Set Membership (SM) identification community to address 
fixed order identification of LTI systems (|7], (9)). In (7) outer 
approximations of the set of parameters consistent with bounded 
noise and stability assumptions are computed. In |9) a convex relax- 
ation approach is suggested for optimization of arbitrary polynomial 
objectives over the set of LTI models consistent with a given data- 
set and a set of stability and bounded noise assumptions. A similar 
approach is taken for identifying Linear Parameter Varying systems 
in |6|. By contrast, in this work we examine a "convex restriction" 
approach where inner approximations of the set of stable models 
are used to guarantee stability and convex upper bounds on the cost 
function of interest are used as a surrogate objective. 

B. Outline 

The paper proceeds as follows. Section [TT] presents the notation, 
problem setup, and a bias elimination strategy employed in this 
work. Next, Section [Hi] provides a convex parameterization of stable 
state-space models and a convex upper bound for simulation error. 
This parameterization and objective are then combined with the bias 
elimination strategy in Section |IV| wherein a system identification 
algorithm based on semidefinite programming is given along with 
asymptotic analysis of the method. Finally, a comparison of the 
proposed algorithm to two alternative least-squares based methods 
is provided in Section [V] 
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Fig. 1. The experimental setup considered in this work. 



II. Preliminaries 

In this section we introduce basic notation, and present the problem 
setup to be addressed in the paper. 



A. Notation 

C kxn stands for the set of all fc-by-n complex matrices, with C n 
being a shorthand for C nxl . R kxn and R" are the subsets of real 
matrices from C kxn and C n respectively. Z™ is the subset of R™ 
whose elements are non-negative integers. 

We use some notation from MATLAB, where A', [A, B], and 
[A; B] denote, respectively, Hermitian conjugation, horizontal con- 
catenation, and vertical concatenation of matrices. For R G <£ kxn 
we denote by [iJ] tt ,& the scalar element in the a-th row and 6-th 
column of R, with the shorthand [v]d = Vd.i used for v G C". In 
addition, for v G C n and a G Z+ , 



is the monomial function of v with vector degree a, and scalar degree 

||a||i, where, for w G C", \\w\\i := Y17=l \l w ]i\ i s me ^ norm of 

w. For Hermitian matrices A,B G C nXTl (i.e. such that A = A' 

and B = B'), A > B (or A > B) means that A — B is, positive 

semidefinite (respectively, positive definite). For R = R' G C nxn 

and v G C n we use the shorthand \v\% = v'Rv. Moreover, when 

R > 0, we also write | u | je := Vv'Rv. When W is a set, £t(W) 

denotes the set of all functions w : {0, 1, . . . , T} — > W 7 . Naturally, 

the elements of ^(W 7 ) are finite length sequences of elements from 
p 

W. The notation — > refers to convergence in probability. 



5. Problem Setup 

We define a data set with N experiments of length T, 
n m -dimensional input, and n x -dimensional state as a collection 
(w,xi,...,xn) of sequences w G fr(R n "), & £ ^T(R nx )- 
T>(n x , n w , N, T) stands for the set of all data sets of given di- 
mensions, number of experiments, and signal length. Accordingly, 
V(n x ,n w ,N) = UT=o'D(n x ,n w , N,T) stands for the set of all 
data sets with unspecified signal length. 

In applications, each Xi(t) is the result of feeding the same input 
w(t) into a system, 5, and measuring the sum Xi(t) = Xi(t) + Vi(t), 
where Xi(t) is the "true system response" and Vi(t) is corrupting 
measurement noise. Additionally, in order to set a measure of quality 
for model predictions, we define an output signal yi (t) is defined by 
yi (t) = Cxi{t), for some fixed matrix C G M. n y xn *. Informally, the 
identification objective will be to accurately predict the input-output 
behavior of this a system with w(t) taken as input and y(t) taken as 
an output (alternatively C can be seen as weighting the importance 
certain components of Xi(t)). This experimental setup is depicted in 
Figure [T] 



The underlying assumption is that the collection of signals Xi(t) 
constitute a reasonable state, or reduced state, for a state-space model 
approximating the system behavior. As an example, when identifying 
a SISO system with input u — u(t) and output y = y(t), one can 
imagine feeding in N ■ D samples of a D-periodic input u(t) and 
measuring y(t) = y(t)+v v (t), where y(t) is the true system response 
and v y (t) is measurement noise. In this case, one could use the above 
setup with Tlx = n w — n < D by taking 
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for i G {0, . . . , N - 1} and t G {0, 
C G R lx " might be C= [1 .. 



, D — n}. Here, the matrix 
01. 



C. State Space Models 

In general, a nonlinear state space model (time invariant, in 
discrete time) with n w -dimensional input and n x -dimensional state 
is specified by a function a : R' la! x R"" — > R" x , which in turn 
defines the input-output function G a ■ R™" X £(R n ™) ~> £(R n ™) 
mapping initial state xo G R na! and input sequence w G ^(R n ™) to 
the output sequence x G i*(R nx ) according to 



s(t) = a(x(t - l),w(t)), x(0) = x . 



(1) 



For w : Z+ i-> R n ™, we define x = G a (xo,w) to be the 
sequence similarly defined by this recurrence. Let x — G a (xo,w), 
x = G a (xo,w) be two responses of system {T} to the same input 
w : Z + h-> R n ™ and different initial conditions xo,Xo. We call 
system ([T} £ 2 -incrementally stable when x — x is square summable 
for all xo, xo, w. The system ([TJ is incrementally exponentially stable 
if there exist constants c > and p G (0, 1), independent of xo,xo 
and w, such that \x{t) — x(t)\ < cp l \xo — xo\ for all Xq,xq, w and 
t > 0. 

This paper deals with subsets of state space models {T} which 
have more specific finite dimensional structure. For positive integers 
n x ,n w ,ne let G,&, 9 be a non-empty set O C R" e and two 
sequences $ = {0;}"^, ^ = {tpi}"^ of real analytical functions 
4>i : R™ 1 x R n ™ R"" , ipi : R"" R n - . We say that the 3-tuple 
(B, $, is a stable projective parameterization with n w inputs, n x 
states, and ng parameters when, for all 9 G G, 

. the function eg : R"" -> R" x defined by eg (a;) = 

Srii[^]*V'i( a; ) is a bijection; 
• the state space model {TJ with a — ag defined by 

ag(x,w) = eg 1 (fg(x,w)) (2) 

i=l <=1 

is £ 2 -incrementally stable. 
Once a stable projective parameterization (O, $, ']/) is selected, a 
stable state space model can be defined by specifying a vector 
parameter 9 G Q. 

Recent discussion and applications of incremental stability and the 
related notions of contractive and convergent systems can be found 
in |16|, |'1|, |29|, and |2|. This property is related to familiar 
"fading memory" conditions employed in other identification and 
system approximation papers (e.g. |5|, |22|, 123]), though we note 



3 



that the condition as defined above does not restrict the behavior of 
solutions with different input sequences. 

In practice, both the particular construction of the signals w(t) and 
Xi(t) from measureable quantities, and the selection of the sequences 
of functions >]/ and $ defining (eg, fg) should be guided by a model 
selection criteria such as cross-validation |14|. We consider both of 
these selections fixed for the remainder of the paper. 



D. Empirical Moments 

For given positive integers n x ,n w ,N let n z — 2n x + n w be the 
dimension of the vectors 



Zi(t) 



Xi(t) 
Xi(t-1) 

w(t) 



(tG {!,..., T}, ie{l,...,N}) (4) 



defined by the data set 5 = (w, xi, . . . , xn) G T>(n x ,n w ,N,T). 
For a G Z'r such that llalli < N and for zi, . . .,zn e R" J define 



Pa(zi, 



n M/M*). 



where 



de{l,...,N} 



> i 



By construction, z a — p a (z, . . . , z), so that one can view p a as a 
multi-linear function which generates the monomial z a when evalu- 
ated on the multi-diagonal (note that such multi-linear functions are 
not uniquely defined by a). For a given data set H = (w, x i , . . . , xn) 
and a £ define the linearized empirical moment jl a (S) by 



1 T 



,2jv(t)). 



(5) 



Since it is sometimes convenient to emphasize /i a (E) as a function 
of variable a with a fixed S, we will also use the equivalent notation 
/2 a (3) = fis(a). According to this notation, for a given data set 3 
with TV experiments, n x states, and n w inputs, fin is a real-valued 
function defined on the set of elements a 6 2^™ x+ni " such that 
||q||i < TV. 

Informally speaking, linearized empirical moments represent an 
attempt at "de-noising" the data contained in the vectors 
as defined by (4| in the case when Xi(t) = x(t) + Vi(t) for 
t G {0,1,..., T}, where x = x(t), the "true system response", 
does not depend on the experiment number i, and the noise variables 
Vi(t) are suitably independent of x and of each other, to produce 
good estimates /i a (H) of the standard empirical moments 



1 

fj, a (x,w) = y^z(i) 



2(t) 



«(*) 
S(f - 1) 
w(t) 



(6) 



This approach is inspired by instrumental variable (IV) techniques, 
1 28 1, with repeated experiments playing a role comparable to a spe- 
cific choice of instruments. Rather than asymptotically approximate a 
least squares parameter estimate, as in IV methods, this work focuses 
on asymptotically minimizing an alternative convex loss function 
that depends only on empirical moments. To have a meaningful 
convergence of the linearized empirical moments we require both 
the aforementioned independence of the noise sequences, to be made 
more precise shortly, and that the true system responses, Xi(t), tend 
to one another despite their differing initial conditions. 



E. Persistence of Excitation 

The following notion of persistence of excitation will be used in 
our consistency analysis. 

Definition 1: Fix two signals w : Z+ — > R n ™ and x : Z+ — > 
R" 1 , and let w^- T ' and x^ be the restriction of these signals to 
{0, . . . , T}. For a given function a : R % x R"" : -> l™ 1 , we say a 
pair of signals (w, x) is persistently exciting for a if there exists a 
positive measure -n on R' 1 * x R"" such that n is supported on an 
open set, and for every finite subset K = {ajjj^j of Z™ z 

liminf X mm (Mr - M?) > 0, 

where M|, G M |N|x|N| are defined by: 

r».rNi / (T) (T)x 

[M T ]i,j = fi ai+aj (w K \x y >), 

[M^]ij = / [a{x, w); x; w] at+c " J div(x, w). 



Informally, this non-standard notion of persistence of excitation 
will be employed to establish a connection between 

1 T 

-Y,\^Hx(t)Mt)))-fe(x{t)Mm 2 
t=l 

vanishing as T — > oo and a being equivalent to ae — e^ 1 o fg. The 
use of a projective representation, i.e. ae being implicitly defined, 
renders several complications to standard consistency arguments 
based on strong convexity (for example, the eg and fg that define 
ag can be non-unique). The above notion of persistence will be used 
to circumvent these difficulties. 

F. Data-Matching Error 

We examine the following loss function for identifying models. 

Definition 2: The T-step simulation error, Jy E , is a function of 
an a : R n * x R n,u -> R n ™ , an initial condition vector x G R™ 1 , 
and two signals w G fr(R""), and x G ^r(R na! ), defined by 

1 T_1 

4 E (a,x ,x,w) = -J2\C(m - ^(t))\ 2 , (7) 



where x — G a (xo,w). 



G. Data Generation Mechanism 

Two data generation mechanisms, defined by considering data 
sequences as stochastic processes, will be analyzed in this work. 
These mechanisms consider signals defined on an infinite horizon, 
i.e. w : Z+ -> R n ™ and Xi : Z+ -> R Ul ", for i G {1, . . . , N}. We 
express Si as the sum of two signals x% and Vi, again representing the 
true system response and measurement noise respectively. Let x\ , 
and iS' T ' be the restrictions to {0, . . . , T} of xt, and w respectively. 
Then we define the data set Ht G T>(n x , n w , N, T) by 



(w 



-IT) ~(T) 



The following assumptions define the first data generation mecha- 
nism. 

(Al) The signal w(t) is a stochastic process for t £ Z, which is 

uniformly bounded in t. 
(A2) The signals Vi(t) are i.i.d. zero mean bounded stochastic 

processes independent of one another, w(t) and each Xi(t). 
(A3) The signals Xi(t) are stochastic processes which are uniformly 

bounded in i and t. There exist constants c > and p G (0, 1) 

such that. 

\ Xi (t) - Xj (t)\ <cp\ Vi, j G {!,..., D},t G Z+, (8) 
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almost surely; 

An alternative, less general, data-generation mechanism in given 
by the assumptions (Al), (A2), and the following. 

(A4) There exists a function a : R" 1 x R"'" -s> K n * such that fj} 
with a = ao defines a BIBO and incrementally exponentially 
stable system and Xi = G ao (xio,w) for some unknown xm £ 
K" x , for i £ {1, ...,N}. The pairs of signals (w,Xi) are 
persistently exciting with respect to ao, as in Definition[T| with 
probability one. 

The appendix contains practical conditions on w(t) and ao that ensure 
(A4) holds. It is immediate that assumptions (Al) and (A4) together 
imply (A3). 

H. Identification Objective 

In this paper, we view system identification algorithms as 4-tuples 
(.A, 0, <&, where (0, $, \t) is a stable projective parameterization 
with n w inputs, n x states, and ng parameters, and A is a function 
A : T>(n x ,n w , N) — > mapping data sets to parameter vectors 
from 0. 

Specifically, we are interested in generating efficient moments- 
based system identification algorithms (.4, 0, <!>, ^f), i.e. those for 
which the function A : T>(n x , n w , TV) — > has the form «A(H) = 
A (p.z), which means that the resulting identified model is a function 
of the linearized empirical moments /is (a) with a £ <^"x+n w 
satisfying ||q||i < TV. 

The main contribution of this paper is the construction of moments- 
based system identification algorithms (A, 0, ^) and sets 0o C 
with the following properties: 

(a) the set a@ = {ag : 9 £ 0o} of models ([TJ generated by 
0o is sufficiently broad, in the sense that every stable linear 
state space model a(x,w) — Ax + Bw is in ae , and some 
non-linear functions are contained in ae as well; 

(b) when a sequence of data sets {Ht}t=i is generated by signals 
(w,xi, . . . ,xn) satisfying assumptions (A1),(A2) and (A3), 
then 9t = A(5t) asymptotically (with respect to T) minimizes 
an upper bound for 

1 N 

1 rSE, ~ — (T) ~(T)n , m 

i=l 

amongst all 9 £ 0. 

(c) when a sequence of data sets {Ht}t=i is generated by signals 
(w, xi, . . . , xn) satisfying assumptions (A1),(A2) and (A4) for 
some ao £ ae , then for 9t = A(St) the convergence of 
ag T (x,w) to ao(x,w) takes place uniformly on every compact 
subset compact subset of R na! xi"". 

III. Convex Parameterization of Models 

In this section we introduce the main construction of this paper: 
a special class of stable projective parameterizations (0, $, ^), in 
which are convex sets defined by a family of linear matrix inequal- 
ities arrived at via an application of the sum-of-squares relaxation, 
1211 . The construction is motivated by the earlier approaches from 
t3l . 1301 , and 1181 . and is intended to improve consistency of the 
associated system identification algorithms. 

In the following definition, x, £, A and q are real vector variables 
of dimensions n x , and w is a real vector variable of dimension 
n w . In addition, z — [£; x; w] and v = [£; x; w; A; q] are the real 
vector variables of dimensions 2n x + n w and 4n x + n w respectively, 
constructed by concatenating £, x, w, A, and g. Given an positive 



integer N let 

Pn = < p(z) = ^2 c ^ zOL '■ C »£ R 

I a6Z" z | ||a||i<iV J 

denote the set of all polynomials composed of monomials with scalar 
degrees no greater than N. 

Given a positive integer TV, a positive constant S, and a function 
II : R 4 "-+' l » -> K"n ; i et @(jv, fl 5 n) be the set of all pairs (9, r) 
of vectors 9 £ M, n " and r £ Vn for which there exist matrices 
P = P' £ R"- x "-, E» = E'i £ R"nx™n (for j g {1,2}), and a 
positive scalar e such that 

P > SI, Ei > 0, E 2 > 0, (10) 

r(z) + 2A'[e e (x + A) - e e (x)] - \A\ 2 P+C , C (11) 
+\q\% - 2q'(f g (x + A, w) - e e (0) = H{v)'^iU(v), 

2A'[e e (x + A)-e (x)]-\A\ 2 P+tI (12) 
+\q\%- 2q'(f s (x + A, w) - fg(x, w)) = U{v)'S 2 U(v), 

where eg and fg are defined by |4). By construction, ©(TV, 8, LI) is 
a convex set defined by a family of linear matrix inequalities. 
Remark 1: The purpose of \ \\\ is to establish the condition 

r(z) + \eg(x + A) — eg(x)\p-i (13) 
> 

\fe(x + A,w)-e e (0\ 2 P -i + \CA\ 2 , 

which in turn serves as a dissipation inequality used to bound 
simulation error when using model {TJ with a = ag. The purpose 
of < | 1 2.| > is to ensure that eg is a bijection and establish the condition 

\eg(x+A) - eg(x)\ 2 P -i (14) 
> 

\fg(x + A,w)-fg(x,w)\ 2 p - 1 +t|A| 2 , 

which is a non-linear version of the Lyapunov inequality, used to 
prove that the model <|TJ with a = ag is I 2 -incrementally stable. 

The following statement explains, partially, the utility of this 
construction. 

Lemma 1: If is the set of all 9 £ R' lfl such that (0, r) £ 
©(TV, 5, II) for some r, then (0, <!>, , 5) is a stable projective pa- 
rameterization. Furthermore, for each (9,r) £ ©(TV, 5, II) and data 
set S = (w, xi, ■ ■ . , x n ) £ T>(n x ,n w , TV, T) the function 

1 N T 

^-iEE^)) (15) 

t=i i=i 

satisfies J r (H) > ~ J2f=i Jr E { a B, 5>(0), Xi, w). 

Proof. For as defined above to be a valid projective parameteriza- 
tion requires that eg be a bijection for all 9 £ 0. The equality ( |12| l 
holding for some P > and £2 > implies that 

2A'{e B (x + A)~eg{x)) > e|A| 2 (16) 

holds for all a;, A £ K™. As eg is continuous, this condition implies 
eg is a bijection ( 1241 . Theorem 18.15). 

Next, we establish the connection between the conditions |TTJ and 
{12} and the inequalities |13]( and l[T4j. For all a,b £ R"* and 
symmetric, positive definite P £ K n ^ xna = ! the inequality 

\a\ 2 P -i > 2b' a - \b\ 2 P (17) 
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holds due to the fact that \a — Pb|p_i > 0. Fixing any (9, r) G 
©(iV, <5, II), and applying {T7} with a = eg(x + A) — ee(x) and 
b = A, we see that there exists a P G R na = xna = such that: 

r(z) + \e g (x + A) - e e (x)|p-i - CA| 2 



+| ? ||.-2g'(/ 9 (a: + A J w)-e e (e))>0 I 



and 



\e 9 {x + A) - e 9 (x)\%-i -e\A\ 2 
+ \q\% - 2q' (f e {x + A,w) - f e (x,w)) > 0, 

hold for all x, £, A, q in R™ x and w G K"™ . Analytically minimizing 
these expressions with respect to q demonstrates these inequalities 
imply {TJ} and {14} hold for all i,(,Ac R' li ' and w G R n ™ . 

Fix xoi , X02 G R' la! and w : Z+ h-> R n ™ , and let a;, be the solution 
Xi — G ag (xoi,w) for i G {1,2}. The inequality {14} with x = xi(t) 
and A = X2(t) — x\(t) implies: 

\ee{x 02 ) - e e (a; i)|p-i > \e e (x 2 (T)) - efl(asi(T))|p-i 

T-l 

+ e^>i(t)-a; 2 (t)[ 2 . 

i=0 

As P > 0, we conclude that {T} with a = ag is ^ 2 -incrementally 
stable. Take = G ag (xi(0),w), then examining {13} with z — Zi(t) 
and A — Xi (t) — Xi (t) leads to 

T T-l 

> |e e ( 2; i(T))-ea(^(T))|J_ 1 +^ |C(*<(t)-5i< W)| 2 , 
t=i t=o 

from which one can readily conclude J r (H) > Jy B (ae,S). □ 

The following definition provides a subset of systems for which 
we can establish consistency results. 

Definition 3: The recoverable set defined by &{N, 8, II) is the 
set of functions a : R" x x R"™ h-> R' lx such that there exists a pair 
(6>, r) G &(N, 8, n) with a(a;, w) = oe(x, w) and 



r([a(a;, w); x; w]) = 



(18) 



for all (x,w) G R nx x R n ™. 

The following lemma establishes that the subset of a& consisting 
of recoverable models can be made to include all stable, linear state- 
space models of appropriate dimensionality. 

Lemma 2: Let $ and ^ be finite sequences of real analytic func- 
tions, as above, whose respective spans include all linear functions, 
and let II : R 4n *+ n ™ i_> R n n ^ e a f unc ti on such that the span of its 
components include all linear functions. Then for all C G R n H xna: 
and N > 2 the recoverable set defined by ®(N,S,T1) includes all 
functions a : R™* x R n ™ M> R"* given by 

a(x,w) = Ax + Bw, (19) 

where A G R n - Xn - is Schur (stable) and B G R" lX ™°'. 

Proof. As A is Schur, there exists a symmetric positive definite 
matrix P solving the Lyapunov equation P — A 1 PA — C'C + 81 
for any positive 8. Choose 9 such that: 

eg(x) = Px, fe(x,w) = Pa(x,w). 

The constraint {TT} is therefore equivalent to 

r(z) + A'PA + \q\l 
+2q'(PAA - e(£) + /(as, io)) - |CA| 2 = n(«)%n(¥). 



Explicit minimization w.r.t. q shows that the polynomial on the left 
hand side of this equality is lower bounded by: 

r{z) + A'PA- \PAA - e(£) + f(x,w)\%-i - CAj 2 . 



This is a concave function of A as A'(P - A' PA - C"C)A = 
— <5|A| 2 . Explicit minimization w.r.t A provides 

r(z)-\e(0-f(x,w)\ 2 Q 

as a lower bound for the original polynomial for some Q = Q' > 0. 
The function r(z) = |e(£) — f(x,w)\Q, belongs to Vn and clearly 
satisfies {18} , Furthermore, with this choice of r(z) the left hand side 
of {TT} is a non-negative quadratic polynomial so that there exists an 
appropriate choice of Ei > to ensure {TT} holds. A similar analysis 
shows that an appropriate choice of E2 > also exists, thus a belongs 
to the recoverable set. □ 



A simple example of a nonlinear function a : I 
to such a recoverable set is given by 

e(a(x,w)) = ^x + b(w) 



. belonging 



where e(x) = |a; + x 3 and b 



is an arbitrary polynomial. 



That an appropriate recoverable set exists is shown in the appendix. 

IV. Identification Algorithm 

This section presents an algorithm for transforming data sets 
H = (w,xi, . . . ,xn) G T>(n x ,n w , N) into parameter vectors 
9 G I™ 9 , followed by an asymptotic analysis of the algorithm. For 
the remainder of this section we define 



N = {a G 



2||o|U <N}. 



Algorithm A.(S, II, K): 

(i) Select a constant 8 > and a function II : R 4rl =+"«> _» R"n ; 
as described in Section [In] Additionally, select a constant k G 

(0,oo]. 

(ii) Form the matrix M H G R |N|x|N| given by: 

where fl a (-) are the linearized empirical moments defined by 
{5}, and let M s be the projection of |(A/s + Mb) on t° the 
closed convex cone of positive semidefinite matrices. 

(iii) Find the 9 G I" 9 , r G Vn and R = R' G Rl N l x l N l that 
minimize: 

tr(RMs) 
subject to (9,r) G @(N,8,Il), 

ji=i 32=1 

and < k. Take (9,f,R) to be an optimizing (9,r,R). 

Remark 2: Note that the algorithm is well defined if any subset of 
the set of vector degrees K is substituted in lieu of H. 

Remark 3: Examining the definition of R and M, one sees that 
when xi = X2 = ■ ■ ■ = ijv, Ms = Ms and the objective function 
tr(i?Ma) is equal to Jf (H), the previously established upper bound 
on simulation error. The additional parameter k, when finite, ensures 
that the optimal value of the optimization problem of step (iii) has 
a continuous dependence on Ms (and by extension, the linearized 
empirical moments). 



A. Asymptotic Analysis 

This section analyzes the properties of algorithm A when data sets 
are generated according to one of the two data generation mechanisms 
described in Section II-G By (w, x\, ... , xn), (vi, ■ • • , vn), and 
(xi, . . . ,xn), we mean those signals described in assumptions (A3) 



or (A4). Let w 



(T) 



i < - T ' > and X: 



are the restrictions of w.Xi and 
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Xi to {0, ...,T}. We define M T G Rl N l x l N l to be a matrix of 
"noiseless" empirical moments, given by 

1 N 

4=1 

The following lemma demonstrates that the linearized empirical 
moments, under suitable assumptions, converge to these noiseless 
empirical moments. 

Lemma 3: Let w : Z+ -> 1™" and Xi : Z + -> 1™", for i G 
{1, . . . , TV}, satisfy assumptions (A1)-(A3). Then 

M St — M T A 

as T — > oo. 

The following statement justifies the use of this algorithm under 
the assumptions (A1)-(A3). 

Theorem 1: Let w : Z + -> R™" and Xj : Z+ ->• R" 1 , for i G 
{1, . . . , N}, satisfy assumptions (A1)-(A3). Fix 8, II, and re as in 
algorithm .4, and let (8t, vt, Rt) be the (8, f, R) found by applying 
algorithm A(S, II, re) to the data set Hy. If re is finite, then for every 
e > and 7 G (0, 1) there exists a T such that, with probability 

1 N 

e + tr(RM T ) > tv(R T M T ) > ^ E i| E (ae T , 2i(0), xf > , w (T) ), 

i=l 

holds for all (6, r, R) feasible for the optimization problem in step 

(hi) of A(8, n, k). 

The proof of the above theorem is in the appendix. 

The following result in characterizes the consistency of A in terms 
of recoverable functions. 

Theorem 2: Fix II : R 4 ™-+"™ _> R"n ; K G (0, 00), 8 > 0. 
Let {Ht}t=i be a sequence of data sets defined by signals w : 
Z+ R™ 1 " and £j : Z+ -> R" 1 , for i G {1, . . . ,iV}, satisfying 
assumptions (Al), (A2), and (A4), where the function ao : R" 1 x 
R n,u — > R"" described in assumption (A4) is in the recoverable 
set defined by ®(N,8,TT). If 9t is the parameter vector found by 
applying algorithm A(5, II, re) to Ht, then 

SU P {l e 9 T 1 (/flr( :r . w )) - a (x,w)\} A 

(a;,Uj)eK 

asT->oo for all compact sets if C R ?lx x R n " . 

The proof of this theorem and supporting lemmas are given in the 

appendix. 

V. Example 

This section examines the performance of algorithm A via a 
simple simulated example. Two alternative identification algorithms 
are introduced based on least square minimization; this is followed 
by a comparison of these algorithms and A on a simulated data set. 

A. Least Squares Identification Approaches 

Let $ = an d * — WiJi^i t> e nxe d sequences of 

polynomial functions, fa : R™* xR"» -> R n * and^i : R™" R" 1 . 
Let a; and A denote real vector variables of dimension n x . Given a 
positive constant 8 and function A : R 2n * -¥ R" A , let Sl(8, A) be 
the set of 8 G R" 8 such that there exists some E G R n A xii a suc h 
that: 

E > 0, 

2A'( e9 (s + A) - e 6 (x)) - 5| A| 2 = A([x; A])'EA([a?; A]). 

From the proof of Lemma [T] we see that G £1(5, A) guarantees ee 
is a bijection. 

The following two algorithms produce a parameter vector 9 from 
a data set S = (w, xi, ■ . . , ijv) G T>(n x ,n w , N, T). 



1) Least Squares Algorithm: Take S to be a 8 G fi((5, A) that 
minimizes 

N T 
i = l t=l 

The following algorithm adapts the least squares objective to use 
the linearized empirical moments p a { ) defined by l(5j with the aim 
of bias elimination. 

2) Modified Least Squares Algorithm: 

(i) Fix a 8 > and k G (0,oo], and let N = {a,}^ C Z™- 
be the smallest set of vector degrees such that for each 8 G G 
there exists coefficients c a G R satisfying e$(£) — fo(x,w) = 

(ii) Define M G Rl N l x l N l by [M] jlth = Aa^+a^ (H), and take 
M to be the projection of | (Af + M') on the cone of positive 
semidefinite matrices. 

(iii) Find the 8 G 0.(8, A), satisfying \8\ 2 < k, that minimizes 

tr ^88' Y^T'iMr^j , 
where T 4 : Rl N l xn « is defined so that [e e (0 - fe(x,w)]i = 

B. Simulated Example 

We provide a simple simulated example to compare the perfor- 
mance algorithm to the least squares algorithms defined above. We 
examine a SISO nonlinear output error data set generated in the fol- 
lowing fashion. The input is a scalar sequence u : {0, . . . , 800} n> R, 
generated as i.i.d. random variables distributed uniformly on [—1,1]. 
We examine the response of the system: 

1 K 1 Q 

x(t + 1) + -x(t + iy = -x(ty + 5u(t) 

O o 

starting from x(0) = 0. For i G {1, . . . , 10}, observed data 
sequences, Xi(t), are generated according to Xi(i) = x(t) + Vi(t) 
where the t>i(t) are i.i.d. zero mean Gaussian random variables 
with variance 0.09 and independent across trials and from the input, 
leading to a signal-to-noise ratio of approximately 25 dB. We take 
C = 1, $ to contain all monomials of degree less than or equal to 
five and \l/ to contain all monomials affine in u and of degree less 
than or equal to three in x. 

The identified models are computed on a subset of the available 
data revealing only the samples with t G {0, . . . ,Th} for each Tk = 
100 • 2 h for h G {0, 1,2,3} . The parameters 8 and re were taken 
to be 0.01 and 00 respectively, and the set K from the definition of 
the modified least squares algorithm was also used for algorithm A. 
The sum-of-squares programs were prepared using YALMIP [15|. 
The choices of II, as in algorithm A, and A, as in the modified least 
squares algorithm, that this software makes ensure that Q(N, 8, II) C 
Q(8, A), i.e. the modified least squares algorithm searches over a 
larger set of models than algorithm A. 

To validate the models we generate an additional input sequence 
u : {0, . . . ,800} h-> R, again i.i.d. uniformly distributed on [-1,1], 
that is independent from the training input, and noise. We compute 
the response x V ai(£) of the true system to this input from zero initial 
conditions and compute a normalized simulation error: 

^2\x, a (t)-x h (t)\ 2 /Y,\^a(t)\ 2 

t=0 t=0 

where Xh(t) is the response of the optimized model to the same 
input and starting from the same initial condition. These calculations 



7 



le-4 



le3 



le2 



lei 



le-1 
le-2 
le-3 
le^ 



42 

larger err 



le-4 
le3 
le2 
lei 



Al MLS LS 



T h =100 



le-1 



le-2 



le-3 



le^ 



31 

larger err 



4 



± 



Al MLS LS 



T h =200 



le-4 
le3 
le2 
lei 



le-1 
le-2 
le-3 
le-4 



14 

larger err 



i 



le-4 
le3 
le2 
lei 



Al MLS LS 



T h =400 



le-1 
le-2 
le-3 
le-4 



4 

larger err. 



+ 
+ 



Al MLS LS 



T h =800 



Fig. 2. Comparison of the algorithm A (A), the modified least squares algorithm (MLS) and the least squares algorithm (LS) for various training horizons. 
Plotted on a log scale is the distribution of the normalized simulation error for the validation input over 1,000 realizations. indicates the number of training 
samples. The vertical scale of 1 indicates identical performance to the model that always outputs zero (i.e. 100 percent simulation error) and le — 2 indicates 
1 percent simulation error. At the top of each plot is the number of MLS models having greater than 1000 percent simulation error. 



were performed for 1,000 independent realizations of the problem. 
Figure [2] plots a comparison of the models generated by the three 
algorithms 

As the amount of available data increases, the distribution of 
validation simulation errors tends toward zero for both algorithm 
A and the modified least squares approach. By contrast the result 
of the least squares approach without modification remains biased, 
though the variance of errors decreases. One sees that the modified 
least squares algorithm generates a large number of poorly performing 
models and generally under-performs algorithm A in terms of median 
as well. Note that the vertical scale in these plots is logarithmic: at 
Th = 200 the majority of the models rendered by algorithm A have 
less that 3 percent validation simulation error. 

VI. Conclusions 

In this paper we have presented a convex optimization approach 
to nonlinear system identification from noise-corrupted data. The 
main contributions are a particular convex family of stable nonlinear 
models, and a procedure for identifying models from data based on 
empirical moments. 

This builds upon previous work by the authors 1301 , (3j and 
1 1 8 1 and offers two main advantages over the previously proposed 
methods: the complexity of the computation does not grow with 



the length of the data, and the empirical moments can be "mixed" 
from different experiments to achieve a consistent estimator when 
the true system is in the model class. This is reminiscent of the 
instrumental variables approach to total least squares, although we 
suggest minimizing an alternative convex criterion. The advantages 
of the proposed method over least squares methods were illustrated 
via a simple simulated example. 

Appendix 
Proofs 

A. Modeling Scenario Satisfying (Al) and (A4) 

The section provides an example of a class of dynamical systems 
and inputs that result in signals satisfying conditions (Al) and (A4). 

Proposition 1: Let w(t) G R"™ be a vector valued i.i.d. stochastic 
process with w(t) having an absolutely continuous distribution with a 
lower semicontinuous density whose support is bounded and contains 
0. Let a : R" x x R nw be a real analytic function such that {j} 
with a = do is BIBO and incrementally exponentially stable and the 
linearization of ao about the unique equilibrium (a;, it) = (xo,0), 
with xq = ao(xo,0), is controllable. Then for any {xio}fLi C R' Ia; 



independent of w(t), the signal w(t) and the signals Xi 



Z 4 
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E" x , each defined by (T) with w(t) = w(t) and Xi(t) = x(t) with 
x{— 1) = Xio, satisfy (Al) and (A4). 

Proof. That w(t) satisfies (Al) is immediate. We use several results 
from |19|. The controllability assumptions given imply that the 
Markovian system generating the sequences (xi(t),w(t)) is weakly 
stochastically controllable (w.s.c) ( 1191 . Corollary 2.2). The BIBO 
stability of {T} with a = ao and boundedness of w(t) ensure 
(xi(t),w(t)) is bounded in probability. This boundedness in proba- 
bility, the w.s.c. of the system, and the conditions on the distribution 
of w(t) imply that the Markovian system generating (xi[t), w(t)) 
is a positive Harris recurrent Markov Chain. Thus there exists a 
unique stationary probability distribution n that is independent of 
initial condition. The w.s.c of the system and the assumptions on the 
distribution of w(t) ensure that the support of tv contains an open 
set. As the support of n is clearly bounded 



i T_1 r 

+ — n " 



hdn, 



(20) 



for all continuous h : I" 1 x R"" -> R n " (QD Proposition 2.3). 
From this we conclude that each pair (w,Xi) is persistently excited 
with respect to ao, as in Definition [T] □ 



where the sum runs over all k : {1, 
is defined by: 



||a||i} -> {0,1} and v K (t) 



We see: 



, T T Hay! 



Mi)- 



(22) 



(23) 



t=l i=l 

Condition ([8J of (A3) implies that limt^oo \zi(i) — Zj(i)\ = which, 
combined with the boundedness of Zi(t), allows us to conclude: 



lim 

T-KJO 



IT) -IT) 



0. 



(24) 



Each u K (t), for k / 0, is obtained as a multi-linear function 
of the noise sequences Vi(t), whose coefficients depend only on 
{zi{t)}f =1 . As the Vi(t) are zero mean, bounded, and independent 
from one another and each Zi(t), we see v K (t) is a zero mean, 
bounded random variable. In addition, condition {8} of (A3) and the 
boundedness of Xi(i) imply the correlation between v K (£) and v k (t) 
decays exponentially as \t — r\ — > oo, hence h Y^t=i u K(t) A 0. □ 



B. A Simple Recoverable Nonlinear System 
Let C = 1 and define a : E x E H> E by: 

e(a(x, w)) = ^x + b(w) 

where e(x) = \x + x 3 and b : E — > E is a polynomial. Let $ 
and \P be sequences of real analytic functions as in the definition of 
eg and fg such that there exists a 6 with e = eg and f — fg. Let 
x, £, A, g, w be real numbers, z — [{ ; a;; to] and « = [£; a;; tu; A; g]. 
With r(z) = 2(e(£) - /(a;, w)) 2 and P = 1, clearly is satisfied 
and the left hand side of the equality \l l\ can be expressed as 

(V2(e(0 - f(x,w)) - q/V2) 2 + (A - g/2) 2 

+ ^g 2 + (V6xA + ^3/2A 2 ) 2 + ^A 4 , 

i.e. as the sum of squares of polynomials. Similarly the left hand size 
of \\2\ can be expressed as: 

(A + g/2) 2 + ^g 2 + (V6xA + ^3/2A 2 ) 2 + i A 4 . 

These sum of squares decompositions show that there is a choice of 
H(v) such that matrices Si > and E2 > satisfying i | I 1 and 
l |12[ l are guaranteed to exist. For this IT and TV > 6, the polynomial 
r belongs to Vn so that a is in the recoverable set defined by 

0(AT,i,n). 

C. Proof of Lemma [3] 

Proof. By assumptions (Al) and (A3), the matrices Ma T are 
uniformily bounded in T with probability one. Since each Mr > 
and projection onto the positive semidefinite cone is a continuous 
function, this boundedness implies it is sufficient to show that 

fi a {~. T ) - li a (x ( P ,w {T) ) AO, 

as T -> 00, for each i G {1, . . . , TV} and a G Z™" with ||q||i < TV. 
Let Zi(i) = [xi{t);xi(t - l);w(t - 1)] and £»(*) = Zi(t) - Zj(t). 
For all a G Z™* with ||q||i < TV we have: 

IMIi 

n^(*)wo = z) "«(*)■ (2i) 

i=l re 



D. Proof of Theorem [7] 

Proof. Optimality of (9t,vt,Rt) implies that: 

tr(R{M ST -MT))+tr(RT(MT-M~ T ))+tiiRM T ) > tr(R T M T ) 

(25) 

for all feasible (8, r, R). From this, and the fact that k > \\R\\% for 
all feasible R, we see: 

2Vk||M t - Mt\\f + tr(A? T P) > tr(A 7 / T PT). (26) 

Fixing 7 G (0, 1), Lemma [5] guarantees there exists a T sufficiently 
large such that \\Mt — Mr |] f < with probability 1 — 7. Taking 
this T, the result follows by examining the definition of r(-), Mt 
and Lemma [T] □ 

£. Proof of Theorem [2] and Supporting Lemmas 

We begin by proving several supporting Lemmas. 

Lemma 4: For any feasible point (8, r, R) of the optimization of 
step (iii) in algorithm A, the function ag = e^ 1 o fg is real analytic. 
Furthermore, we have: 

r(z) > i|e«(£) - ,fe(x,™)| 2 > - a s (a;,ui)| 2 , (27) 

where z = [£; x; w], with ^, x G E' 1 * and to G E n "' . 

Proof. Examining {12} with x = xi and A = £1 — X2 we see that: 

2A'(eg(x + A)-eg{x)) > |A|| (28) 

for all a;, A G R" 1 . Letting = the above implies 

E(x) + E{x)' > P > SI for all x G E n - as e 9 (-) is continu- 
ously differentiable. Thus det(E(x)) / for all x G E" x , and 
as eg(£) — fg(x,w) is real analytic the Inverse Function Theo- 
rem for holomorphic maps ( 1101 Chapter I, Theorem 7.6) ensures 
eg 1 (fe{x, w)) is real analytic. 

The condition {TT} for x\ = X2 — x gives 

r{z) > \eg(0 - feix^)^-, > i|e e (|) - fe{x,w)\ 2 , (29) 

as P > SI. Furthermore, 

\ e e(0 - fe(x,w)\ 2 = \e e (£) - e g (ag(x, w))\ 2 
= \£-a e (x,w)\%, M , 
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where E = J* E(ae(x, w) + r(£ — ae(x, w)))dr. We see that 

£'£> S (E + E')--I> -I, 

where the first inequality follows from (E — ^I)'(E — |/) > and 
the second inequality from E being a convex combination of point 
evaluations of E(x). From this we can conclude {27\ □ 

Now we present the proof of Theorem [2] 

Proof. Fix any compact set K C K" x x M n ™ . By assumption there 
exists a (0o, ro, Ro), feasible for the optimization in step (iii) of 
algorithm A(S, II, n) such that do = ag and ro(ao(x, w), x, w) = 0. 
This implies tr(_R M T ) = 0, so that, by Lemma[3j tr{R Ms T ) A 
0. 

As (8o,ro, Ro) is in the feasible set of the optimization in step 
(iii) of A(5, II, k) and (9t,vt,Rt) is optimal, 

tr(iioM BT ) > tr(R T Ms T ), (30) 

so that tr(RTMs T ) A as well. Moreover, as ||-Rt||f < K > 

Vk\\M St - Mt\\f +tr(R T Ms T ) > tr(R T M T ), (31) 

by Cauchy-Scharwz. Lemma [3] now implies tr( RtMt) — > 0. 

As each tt is a polynomial and the matrices Rt are uniformily 
bounded, 

As each (sbi, w) is persistently exciting with respect to ao (Defini- 
tion[T|(, there exists of a positive measure n on the space R" 1 x R" U! , 
supported on an open set, such that for all e > there exists a 
To G Z+ with 



e|| Rt II f + ^(RtMt) > J rT([ao(x,w);x;w})dn(x,w), 

for all T > To. As || Rt\\f < f° r au T, this sequence of integrals 
converges to zero. Lemma [4] now implies 



Tt{\oo{x, w); x; w])dn(x, w) 



> 



\eg T (a (x,w)) - fg T (x,w)\ 2 d7Y(x,w). 



From that same lemma we see that the map (x,w) i-> 
eg T (ao(x,w)) — fe T (x,w) is real analytic. 
Let L be the subspace of R™" defined by 

L = {8\ = e e (ao(x,w))-fe(x,w), V (x,w) G R™ 1 x R' 1 ™}. 

Let I 7 be the quotient space V = R"" / L, i.e. the set of equivalence 
classes where 6 = if eg(ao(x, w)) — fe(x, w) = 0. The following 
functions are norms on V: 



\\0\\ = J J \e e (a (x,wj) - f g (x,w)\ 2 diY(x,w), 
\0\ao.u = sup {\e e (a (x,w)) - fg(x,w)\}, 

(x,m)eU 

where U is any bounded open set. Both functions are clearly bounded, 
homogeneous, sub-additive and positive for all 9. \\6\\ = \6\oo,u = 
when 8 = so that the functions are well-defined functions on V. 
As U is open and it is supported on an open set we see that 8^0 
implies both ||0|| > and |0|oo,t/ > 0, by the Identity Theorem 
for analytic maps 1101 . Now, fix U to be a bounded open set with 
U D K. As all norms are equivalent for finite dimensional spaces, 
we know there exists a constant c > such that c||0|| > |0|oo,ir. 



That O = implies ||0 T || A 0, so that \8 T \oo,u A 0. Lemma [4] 
then yields 
2 

—7=\9t\oo,u> sup {\eJ(fe T (x,w))-a (x,w)\}, 
Vo (x,w)eK 



which establishes the claim. 



□ 
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